Systems Biology Lab
2024-11-01
You may have heard this statement:
Correlation does not imply causation
Here we state that it does:
We will get back to this in January in the course Biosystems Data Analysis
The p-value of a correlation test on these data equals 0.0079.
It is obvious why many call an accidental correlation a spurious correlation, spurious meaning false or fake.
However, why do they also call the correlation between storks and birth rates a spurious correlation? What aspect is false or fake about this correlation?
Do you agree that both cases are often lumped together under the term spurious correlation?
We are easily tricked to draw false conclusions. Also trained scientists get caught in this trap …
microbiome \(\longrightarrow\) autism
autism \(\longrightarrow\) dietary preference \(\longrightarrow\) microbiome
The controversy continues …
(note the second author of this paper and the one above)
A drug is tested on a group of 800 persons. This is the outcome:
When split according to sex
Should we treat or not treat?
Petabytes allow us to say: “Correlation is enough.”
Chris Anderson, “The End of Theory”, WIRED Magazine
\(X\) causes \(Y\) (Varying \(X\) leads to variation in \(Y\))
\(Y\) causes \(X\) (Varying \(Y\) leads to variation in \(X\))
In case of three random variables we have three posible patterns
What can we say about the joint probability distributions \(P(Y,Z)\)?
What can we say about the conditional joint probability distributions \(P(Y,Z|X)\)?
We have seen dependence/independence of variables from a joint distribution
Independence, \(X \perp Y\):
\[ P(X,Y) = P(X) \cdot P(Y) \quad \text{or} \;\; P(Y|X) = P(Y) \]
Dependence \(X \not\perp Y\):
\[ P(X,Y) \neq P(X) \cdot P(Y) \quad \text{or} \;\; P(Y|X) \neq P(Y) \]
Now it is time to define conditional independence and conditional dependence
Important: we consider joint distributions of at least three random variables to define these new concepts
Conditional probability in the presence of a third variable
\[ P(X,Y\,|\,Z) = \frac{P(X,Y,Z)}{P(Z)} \]
Two variables are independent conditionally on a third variable (conditionally independent) when the following holds for their joint distribution:
\[ P(X,Y \,|\, Z) = P(X \,|\, Z) \cdot P(Y \,|\, Z) \]
Or briefly \(X \perp Y \,|\, Z\)
\[ P(Y\,|\,X) = \frac{P(X,Y)}{P(X)} \]
Example: \(P(Y\,|\,X=450)\)
According to the scheme
\[ P(X,Y \,|\, Z) = P(X \,|\, Z) \cdot P(Y \,|\, Z) \]
Examples: at \(Z=9\) and \(Z=34\)
Is the proposed causal scheme (\(Z\) as a confounder or common cause) the only causal scheme explaining conditional independence?
No. We will also obtain conditional independence of \(X\) and \(Y\) if \(Z\) is a mediator:
Data alone do not provide the causal direction
\[ \begin{align} \text{Season} &\longrightarrow \text{Fledglings}\quad \Rightarrow \\ \text{Season} &\not\perp \text{Fledglings} \end{align} \]
\[ \begin{align} \text{Season} & \longrightarrow \text{Food} \longrightarrow \text{Fledglings} \quad \Rightarrow \\ \text{Season} & \not\perp \text{Fledglings} \end{align} \]
\[ \text{Season} \perp \text{Fledglings}\;|\;\text{Food} \]
# Generating mock data
set.seed(123)
n <- 20
seasons <- c('Spring','Summer','Autumn')
N <- n*length(seasons)
food <- c(8,2.4,0.2)
names(food) <- seasons
d1 <- tibble(
Season = factor(rep(seasons, n), levels=seasons)
)
d1 <- d1 %>%
mutate(Food = food[Season] * exp(rnorm(N,sd=1))) %>%
mutate(Young = Food*exp(rnorm(N, sd=1))) %>%
mutate(lgFood = log(Food), lgYoung = log(Young))Residuals ~ Season| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| Season | 2 | 0.871 | 0.436 | 0.591 | 0.557 |
| Residuals | 57 | 42.023 | 0.737 | NA | NA |
lgYoung ~ lgFood and lgYoung ~ lgFood + Season| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| lgYoung ~ 1 | 59 | 278.298 | NA | NA | NA | NA |
| lgYoung ~ lgFood | 58 | 42.894 | 1 | 235.404 | 313.901 | 0.000 |
| lgYoung ~ lgFood + Season | 56 | 41.996 | 2 | 0.898 | 0.598 | 0.553 |
Since adding Season as a predictor does not significantly improve the model fit compared to using Food alone, the causal diagram is in agreement with the data.
# Generating mock data
set.seed(1234)
predation <- c(0.2,4,6)
names(predation) <- seasons
d2 <- tibble(
Season = factor(rep(seasons, n), levels=seasons)
)
d2 <- d2 %>%
mutate(Food = food[Season] * exp(rnorm(N, sd=1))) %>%
mutate(Predation = predation[Season] * exp(rnorm(N, sd=1))) %>%
mutate(Young = (Food/Predation)*exp(rnorm(N, sd=1))) %>%
mutate(lgFood = log(Food), lgYoung = log(Young), lgPredation = log(Predation))Residuals ~ Season| term | df | sumsq | meansq | statistic | p.value |
|---|---|---|---|---|---|
| Season | 2 | 83.976 | 41.988 | 15.571 | 0 |
| Residuals | 57 | 153.702 | 2.697 | NA | NA |
lgYoung ~ lgFood and lgYoung ~ lgFood + Season| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| lgYoung ~ 1 | 59 | 636.595 | NA | NA | NA | NA |
| lgYoung ~ lgFood | 58 | 237.678 | 1 | 398.917 | 189.021 | 0 |
| lgYoung ~ lgFood + Season | 56 | 118.184 | 2 | 119.494 | 28.310 | 0 |
Since adding Season as a predictor significantly improves the model fit compared to using Food alone, the causal diagram does not agree with the data. In particular, there may be a second route from Season to Fledgelings.
| Treatment | Not recovered | Recovered | Total | Recovery rate |
|---|---|---|---|---|
| Not treated | 24 | 16 | 40 | 40% |
| Treated | 20 | 20 | 40 | 50% |
When splitting the result according to sex
| Sex | Treatment | Not recovered | Recovered | Total | Recovery rate |
|---|---|---|---|---|---|
| Female | Not treated | 21 | 9 | 30 | 30% |
| Female | Treated | 8 | 2 | 10 | 20% |
| Male | Not treated | 3 | 7 | 10 | 70% |
| Male | Treated | 12 | 18 | 30 | 60% |
| Sex | Treatment | Not recovered | Recovered | Total | Recovery rate |
|---|---|---|---|---|---|
| Female | Not treated | 21 | 9 | 30 | 30% |
| Female | Treated | 8 | 2 | 10 | 20% |
| Male | Not treated | 3 | 7 | 10 | 70% |
| Male | Treated | 12 | 18 | 30 | 60% |
A backdoor is a confounder. It has unblocked paths (paths without a collider) that have arrows ending both on the potential cause and potential effect.
Testing model y ~ treatment
Analysis of Variance Table
Response: y
Df Sum Sq Mean Sq F value Pr(>F)
treatment 1 248.6 248.65 1.7854 0.183
Residuals 198 27574.8 139.27
How to condition on Sex?
Graphically as well as using linear modeling and ANOVA
Compare
y ~ treatmenty ~ treatment + sex| term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
| y ~ treatment | 198 | 27574.8 | NA | NA | NA | NA |
| y ~ treatment + sex | 197 | 18347.1 | 1 | 9227.682 | NA | 0 |
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 85.7 | 1.0 | 82.3 | 0 |
| treatmentTRUE | -10.1 | 1.6 | -6.4 | 0 |
| sexM | 15.7 | 1.6 | 10.0 | 0 |
What else should we investigate in these data?
How do we know that the causal relation is not like this? Does the data tell us anything about this?
There is no way of knowing this just from the data. We need to have additional information about the experiment and/or prior knowledge about the system.
Or, using an instrumental variable, (see syllabus)
\[ \begin{align} \text{Species X} &\perp \text{Disease} \\ \text{Species X} &\not \perp \text{Disease} \;|\; \text{Cholesterol} \end{align} \]
The full correlation network
The simplified correlation network
Show, using linear models and ANOVA, that the causal scheme shown in slide 1.9 is in accordance with the data. In particular:
Surface is a good predictor of Birth_rate.Storks provides no additional information about Birth_rate than Surface alone does.Humans (population size) is a good predictor of Birth_rateSurface provides no additional information about Birth_rate than Humans alone does.Storks are a cause for Birth_rate?The data are available here: https://few.vu.nl/~molenaar/courses/data/causalinference/storks_and_birth_rate.csv
d <- read_csv('https://few.vu.nl/~molenaar/courses/data/causalinference/storks_and_birth_rate.csv', comment="#")
lm0 <- lm(Birth_rate ~ 1, data=d)
lm1 <- lm(Birth_rate ~ Surface, data=d)
lm2 <- lm(Birth_rate ~ Surface + Storks, data=d)
lm3 <- lm(Birth_rate ~ Humans, data=d)
lm4 <- lm(Birth_rate ~ Humans + Surface, data=d)
lm5 <- lm(Birth_rate ~ Humans + Surface + Storks, data=d)
lm6 <- lm(Birth_rate ~ Storks, data=d)
a1 <- anova(lm0,lm1,lm2)
a2 <- anova(lm0,lm3,lm4,lm5) # Surface adds information on top of Humans
a3 <- anova(lm0,lm6,lm2) # Data alone does not rule out effect of Storks on Birth_rate
a1 |>
broom::tidy() |>
flextable::flextable() |>
flextable::autofit()term | df.residual | rss | df | sumsq | statistic | p.value |
|---|---|---|---|---|---|---|
Birth_rate ~ 1 | 16 | 2,690,208.0 | ||||
Birth_rate ~ Surface | 15 | 400,603.0 | 1 | 2,289,604.0 | 86.0 | 0.0 |
Birth_rate ~ Surface + Storks | 14 | 370,796.0 | 1 | 29,807.0 | 1.0 | 0.3 |
Surface is a fair predictor of Birth_rateSurface means including it as a predictor in the statistical modelSurface we see from the ANOVA that Storks provides no additional predictive valueSurface through Humans, there could be an additional other effect of Surface on Birth_rate.A popular account of causal inference: